Compare two versions of a website
On this page
Compare two versions of a website#
Websites evolve; their content changes over time. The Wayback Machine is a crawler that runs periodically to automatically archive websites. Every time it crawls a website, it creates a snapshot of that website at that moment in time. This snapshot trail can show you what changed on the website between two timestamps.
This tutorial shows you how to do these tasks:
Retrieve a list of all available versions of a website.
Compose the URLs for the versions to compare.
API used#
Prerequisites#
The instructions in this tutorial use the cURL
command. Most computers have this protocol pre-installed. To see if it’s installed on your computer, at the command prompt, run the following command:
curl
You should get an output similar to this:
curl: try 'curl --help' for more information
If you don’t see this output, install cURL
.
Steps#
This task is has two steps.
Step 1. Get a list of available snapshots#
Run a command in the following syntax:
curl -X GET "http://web.archive.org/cdx/search/cdx?url=<URL>"
where <URL>
is the URL of the website whose snapshots you’re retrieving.
The result has the following components, separated by a single space:
urlkey
: A canonical transformation of the URL you supplied, for example,org,eserver,tc)/
. Such keys are useful for indexing.timestamp
: A 14 digit date-time representation in theYYYYMMDDhhmmss
format.original
: The originally archived URL, which could be different from the URL you supplied.mimetype
: The mimetype of the archived content, which can be one of these:text/html
warc/revisit
statuscode
: The HTTP status code of the snapshot. If the mimetype iswarc/revisit
, the value returned for thestatuscode
key can be blank, but the actual value is the same as that of any other entry that has the samedigest
as this entry.digest
: TheSHA1
hash digest of the content, excluding the headers. It’s usually a base-32-encoded string.length
: The compressed byte size of the corresponding WARC record, which includes WARC headers, HTTP headers, and content payload.
Example request#
curl -X GET "http://web.archive.org/cdx/search/cdx?url=tc.eserver.org"
Example response#
org,eserver,tc)/ 20180515033912 http://tc.eserver.org:80/ text/html 302 RK36SX4X6VJ44FMUWDK4QYFPYGBYUJUH 404
org,eserver,tc)/ 20180716082607 http://tc.eserver.org:80/ text/html 302 RK36SX4X6VJ44FMUWDK4QYFPYGBYUJUH 405
org,eserver,tc)/ 20180915160723 http://tc.eserver.org:80/ text/html 302 RK36SX4X6VJ44FMUWDK4QYFPYGBYUJUH 404
org,eserver,tc)/ 20181014163006 http://tc.eserver.org/ warc/revisit - RK36SX4X6VJ44FMUWDK4QYFPYGBYUJUH 502
org,eserver,tc)/ 20181115172501 http://tc.eserver.org:80/ text/html 302 RK36SX4X6VJ44FMUWDK4QYFPYGBYUJUH 404
org,eserver,tc)/ 20181228210547 http://tc.eserver.org/ warc/revisit - RK36SX4X6VJ44FMUWDK4QYFPYGBYUJUH 500
Step 2. Compare the website versions#
Snapshots archived by the Wayback machine contain the following prefix to URLs: http://web.archive.org/web/<time stamp>/
. So, for example, if a snapshot of the website at tc.eserver.org/
was archived on 27 April 2018 at 13:06:34 hrs, the URL of the snapshot is http://web.archive.org/web/20180427130634/https://tc.eserver.org/
.
From the list you generated in the previous step, pick two timestamps, and compose their URLs. For example,
http://web.archive.org/web/20180427130634/https://tc.eserver.org/
andhttp://web.archive.org/web/20181115172501/https://tc.eserver.org/
.Open your favourite diff tool, and use compare the two versions.
If you don’t see any difference, it might be that the digests of both the websites are the same. If so, pick two versions that have different digests, and compare them.
Wayback Changes#
“Wayback Changes” is a tool you can use to identify, and display, changes in the content of archives
of URLs.
To access it use the following URL syntax: https://web.archive.org/web/changes/<URL>
.
First you can select two different archives for a URL, based on an interface that shows the degree of relative change from one archive to another.
Then you can see the replay of the two URLs you select, side-by-side, with changes highlighted in Blue and Yellow.